AITopics | evaluation plan

Collaborating Authors

evaluation plan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Saha, Swarnadeep, Li, Xian, Ghazvininejad, Marjan, Weston, Jason, Wang, Tianlu

arXiv.org Artificial IntelligenceJan-29-2025

LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

evalplanner, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2501.18099

Country:

North America > United States > Florida > Miami-Dade County > Miami (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

"Close...but not as good as an educator." -- Using ChatGPT to provide formative feedback in large-class collaborative learning

Ponte, Cory Dal, Dushyanthen, Sathana, Lyons, Kayley

arXiv.org Artificial IntelligenceNov-2-2023

Delivering personalised, formative feedback to multiple problem-based learning groups in a short time period can be almost impossible. We employed ChatGPT to provide personalised formative feedback in a one-hour Zoom break-out room activity that taught practicing health professionals how to formulate evaluation plans for digital health initiatives. Learners completed an evaluation survey that included Likert scales and open-ended questions that were analysed. Half of the 44 survey respondents had never used ChatGPT before. Overall, respondents found the feedback favourable, described a wide range of group dynamics, and had adaptive responses to the feedback, yet only three groups used the feedback loop to improve their evaluation plans. Future educators can learn from our experience including engineering prompts, providing instructions on how to use ChatGPT, and scaffolding optimal group interactions with ChatGPT. Future researchers should explore the influence of ChatGPT on group dynamics and derive design principles for the use of ChatGPT in collaborative learning.

chatgpt, formative feedback, participant, (13 more...)

arXiv.org Artificial Intelligence

2311.01634

Country: Asia > Singapore (0.05)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (1.00)
Instructional Material > Course Syllabus & Notes (0.71)

Industry:

Health & Medicine > Health Care Providers & Services (0.91)
Education > Assessment & Standards > Assessment Methods (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.30)

Add feedback

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

Saha, Swarnadeep, Levy, Omer, Celikyilmaz, Asli, Bansal, Mohit, Weston, Jason, Li, Xian

arXiv.org Artificial IntelligenceOct-23-2023

Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA-2-chat to match or outperform GPT-4 on most domains. On the constraint story generation task, BSM improves the coherence of the stories while also improving constraint satisfaction by 12%.

arxiv preprint arxiv, bsm, evaluation, (12 more...)

arXiv.org Artificial Intelligence

2310.15123

Country:

North America > United States > Hawaii (0.05)
North America > Canada > Ontario > Toronto (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Precomputing Datalog evaluation plans in large-scale scenarios

Fiorentino, Alessio, Leone, Nicola, Manna, Marco, Perri, Simona, Zangari, Jessica

arXiv.org Artificial IntelligenceJul-29-2019

In this scenario, to reduce memory consumption and possibly optimize execution times, the paper proposes novel techniques to determine an optimal indexing schema for the underlying database together with suitable body-orderings for the Datalog rules. The new approach is compared with the standard execution plans implemented in DL V over widely used ontological benchmarks. The results confirm that the memory usage can be significantly reduced without paying any cost in efficiency. This paper is under consideration in Theory and Practice of Logic Programming (TPLP). KEYWORDS: Datalog; Query Answering; Ontologies; Query-plan; Data Indexing 1 Introduction Ontological reasoning services represent fundamental features in the development of the Semantic Web. Among them, scientists are focusing their attention on the so-called ontology-based query answering (OBQA), where a Boolean query has to be evaluated against a logical theory (knowledge base) consisting of an extensional database paired with an ontology (Cal ı et al. 2009; Ortiz 2013; Amendola et al. 2018).

artificial intelligence, evaluation plan, logic & formal reasoning, (15 more...)

arXiv.org Artificial Intelligence

1907.12495

Country: Europe (0.29)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)

Add feedback